Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Davidl root reader t directory #579

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

faustus123
Copy link

BEGINRELEASENOTES
Add ROOTReader::openTDirectory() to allow external management of TDirectory (e.g. TMemFile) to support streaming.
ENDRELEASENOTES

This is motivated by issue #565 . It makes it possible to write a stream source that sends data packets in the form of serialized TMemFile objects using ROOT's built-in serialization/deserialization functionality. Each TMemFile may contain a small number of events in the "events" tree and the complete metadata trees allowing the other standard ROOTReader methods to work. I have a working test case that reads events from an existing PODIO ROOT file and sends them this way via zmq messages. This couples with the ePIC recon code and works so there is a proof-of-principle that this works as needed.

A couple of notes on what was done:

  1. A TTree *m_metaTree member was added to take the place of the TChain in most of the calls for which a TChain is not actually needed. When a TChain is actually used, this just points to the TChain.

  2. The TChain was changed from being kept in a unique_ptr to being a direct member of the class. I suspect the unique_ptr trick was used because TChain has its copy and assignment constructors unavailable as private members of the TChain class and that causes some issues with storing these in std::map. I ran into other complications with trying to implement m_metaTree to support both cases (an externally supplied TTree or an internal TChain) I resolved it by explicitly deleting the copy and assignment constructors for ROOTReader which avoids the problem of ownership of the TChain. Removing the unique_ptr mechanism was then possible which simplified things in the class. There may have been other reasons for using unique_ptr which I don't know about.

  3. A new test was added read_frame_root_tdirectory which is almost a complete copy of read_frame_root_multiple. I can't claim to fully understand the testing configuration, but hopefully this is in line with what you would like.

Copy link
Collaborator

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this. It's a slightly different approach than I would have attempted for this, but I think it's also quite elegant as it re-uses quite a bit of existing machinery. Also summoning @hegner for a few more high level comments.

One question: How do you build the TMemFiles on the writing side? (Or plan to build)

My main concern for this approach is that this ties the streaming of Frames rather tightly to ROOTs TTree backend. I would have tried to serialize on the FrameData level, which should be possible without any dependency on a specific backend. However, since you have a proof-of-principle already for the full chain, your approach is potentially a bit easier to implement (at least on the podio side), since a lot of the heavy lifting is done else where. You probably have a better feeling for how involved the overall approach is.

Another somewhat smaller concern is performance. When I first implemented Frame I/O I didn't pay too much attention to the things that happen "only once", e.g. when opening a file, or when first setting up a category for reading the first entry. Hence, I crammed quite a bit of (potentially) unnecessary work into that since the overhead will be distributed in any case. I am not entirely sure how things look if this "overhead" happens repeatedly for just a few events. In case all of the events that you stream have the same metadata, and the same categories there might be a way to cache a bit more of this information between calls to openTDirectory in case performance becomes an issue. Until that is the case, I would prefer the current version, since has fewer assumptions and is more readable.


I can't claim to fully understand the testing configuration, but hopefully this is in line with what you would like.

Yeah, it's not the simplest setup, but I haven't yet found the time to clean it up a bit better, now that the legacy EventStore is gone which drove some of it's implementation. I think overall the current approach is OK, I would maybe try to reduce some code duplication, but we can also do that at the end.

@@ -189,54 +197,87 @@ std::vector<std::string> getAvailableCategories(TChain* metaChain) {
return brNames;
}

/// @brief Read version and data model from the m_metaTree
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring needs to be in the header file in our doxygen configuration.


#include "TFile.h"

int read_frames(podio::ROOTReader& reader) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (on a quick first glance) seems to more or less be a copy of the templated version in read_test.h. Maybe that can be adapted to split off a version with an already initialized reader (initialized == file / TDirectory opened).

@faustus123
Copy link
Author

One question: How do you build the TMemFiles on the writing side? (Or plan to build)

You can see a pair of programs that do the serialization/deserialization and zmq messaging here. These are independent of the PODIO library. They are specific to the ROOT file format of the ePIC simulated data files in that it assumes 4 TTrees with the event data being kept in "events".

https://github.com/JeffersonLab/SRO-RTDP/tree/davidl_podio2tcp/src/utilities/cpp/podio2tcp

How this is implemented on the ePIC side is via a JANA JEventSource plugin. A rough prototype of that can be seen here. It mostly copies the one already in EICrecon that uses ROOTReader::openFiles(). This will need a lot of cleanup, but the core functionality is there.

https://github.com/JeffersonLab/SRO-RTDP/blob/davidl_podio2tcp/src/utilities/cpp/podiostream/podiostreamSource.cc

My main concern for this approach is that this ties the streaming of Frames rather tightly to ROOTs TTree backend. I would have tried to serialize on the FrameData level, which should be possible without any dependency on a specific backend. However, since you have a proof-of-principle already for the full chain, your approach is potentially a bit easier to implement (at least on the podio side), since a lot of the heavy lifting is done else where. You probably have a better feeling for how involved the overall approach is.

I agree 100% that a better implementation will be to do this at the Frame level. This approach was quicker for me to implement and didn't require my digging too deep into the PODIO source. It will serve my short term goals until someone more familiar with PODIO can do it the right way. I also think the openTDirectory() still adds some flexibility that may be useful in some other cases.

Another somewhat smaller concern is performance. When I first implemented Frame I/O I didn't pay too much attention to the things that happen "only once", e.g. when opening a file, or when first setting up a category for reading the first entry. Hence, I crammed quite a bit of (potentially) unnecessary work into that since the overhead will be distributed in any case. I am not entirely sure how things look if this "overhead" happens repeatedly for just a few events. In case all of the events that you stream have the same metadata, and the same categories there might be a way to cache a bit more of this information between calls to openTDirectory in case performance becomes an issue. Until that is the case, I would prefer the current version, since has fewer assumptions and is more readable.

Yes, performance is not great, but it is currently obscured by the horribly slow ePIC reconstruction code. When reading from a file, it takes 2-5 seconds to reconstruct an event. Preliminary testing with pre-formed send buffers yields sending takes only about 0.02 seconds/event.

When it comes to reading in the meta-data, one knob we have to turn is how many events are sent at once. I can just as easily send 100 events or more at a time which does reduce that overhead. I'll tune that as needed when I start scaling up for more performance testing.

@tmadlener
Copy link
Collaborator

Thanks for the links. After a quick look and IIUC the main use case here is to read from stream in Jana2, but the source of the stream would then be outside of Jana2? I am mainly asking, because I was wondering whether you also planned to implement (or actually need) the creation of the TMemFile somewhere in memory. Currenty, it seems that everything starts from an existing podio file on disk rather.

@faustus123
Copy link
Author

Thanks for the links. After a quick look and IIUC the main use case here is to read from stream in Jana2, but the source of the stream would then be outside of Jana2? I am mainly asking, because I was wondering whether you also planned to implement (or actually need) the creation of the TMemFile somewhere in memory. Currenty, it seems that everything starts from an existing podio file on disk rather.

Correct, there is no need for a JANA2 process on the sending side.

I do not anticipate using TMemFile in any production streaming system. It is just the (relatively) easy way to serialize ROOT objects which is the form the simulated ePIC data happens to be in. My expectation is that the DAQ WG will decide on some custom format for the stream data driven by the customized nature of the electronics. We may deal with that either in a specific JANA2 JEventSource or by some future transformer code that converts it into a PODIO stream (whatever that design turns out to be).

It is true that at the moment and for the next couple of years we will likely only have simulated data and it will be in the form of podio files on disk. Hence the need to generate stream(s) from that so the SRO infrastructure can start development.

@faustus123
Copy link
Author

I haven't heard anything on this for 2 months. Are there any plans to merge it or any hold-ups preventing it?

@hegner
Copy link
Collaborator

hegner commented Jun 11, 2024

@faustus123 - apologies for having it dormant for so long. We've been doing some refactoring which we just managed to finish. If you can update your PR I can deal with it this week.

@hegner hegner self-requested a review June 11, 2024 19:31
@faustus123
Copy link
Author

Uggh. I just tried to rebase to bring my fork up to date and a lot of changes have happened in the ROOTReader files in the meantime. It is going to take a while to try and resolve this. It would probably be more efficient if I do this after you have finished with whatever work you are doing there. Please let me know when you are finished with the current round of work and I will take a look at it then.

@tmadlener
Copy link
Collaborator

The changes that we want to go in first are the ones in #625. Looking at the changes in this PR, I am wondering whether you cherry-picking might be easier than rebasing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants